Recap of Data Wrangling with dplyr

STAT 331

Ugliest Plot

https://docs.google.com/presentation/d/19u5djgMsPLtxoM-rfAQuLyP89B4nYdW8uBhNAEJQoS8/edit?usp=sharing

Serif fonts “have an extra flourish that makes it look pretty for many people, but can clutter what is on the page and that’s what makes it harder to distinguish for people with visual disabilities than just having a very clean font with no extra bits and pieces around it.”

Changing Font in ggplot2

Option 1: Using your system’s fonts

windowsFonts()
$serif
[1] "TT Times New Roman"

$sans
[1] "TT Arial"

$mono
[1] "TT Courier New"
ggplot(data = penguins, 
       mapping = aes(x = bill_length_mm, 
                     y = bill_depth_mm)) +
  geom_point() +
  labs(title = "Penguins from the Palmer Archipelago",
       x = "Bill Length (mm)",
       y = "Bill Depth (mm)") + 
  theme(plot.title    = element_text(family = "sans", size = 28),,
        axis.title.x  = element_text(family = "serif", size = 28), 
        axis.title.y = element_text(family = "mono", size = 28)
        )

Changing Font in ggplot2

Option 2: showtext R package

font_add_google("Gochi Hand", "gochi")
font_add_google("Montserrat", "montserrat")

ggplot(data = penguins, 
       mapping = aes(x = bill_length_mm, 
                     y = bill_depth_mm)) +
  geom_point() +
  labs(title = "Penguins from the Palmer Archipelago",
       x = "Bill Length (mm)",
       y = "Bill Depth (mm)") + 
  theme(plot.title = element_text(family = "montserrat", size = 40),
        axis.title.y = element_text(family = "gochi", size = 40),
        axis.title.x = element_text(family = "gochi", size = 40)
  )

The tidyverse Philosophy

A Vignette

subset()

Return subsets of vectors, matrices or data frames which meet conditions.

subset argument states how the rows of the dataframe should be filtered

subset(surveys, 
       subset = 
         species_id == "DS")

select argument states what columns should be selected from the dataframe

subset(surveys, 
       subset = species_id == "DS", 
       select = c(weight, 
                  hindfoot_length)
       )

We want functions that accomplish one task!

We want functions with intuitive names!

Data Wrangling Verbs

filter()

select()

mutate()

summarize()

arrange()

group_by()

Brainstorm definitions for each verb

filter()

select()

mutate()

group_by()

summarize()

arrange()

The Pipe |>

Preview Activity Review

Suppose we would like to study how the ratio of penguin body mass to flipper size differs across the species. Arrange the following steps into an order that accomplishes this goal (assuming the steps are connected with a |>).

arrange(med_mass_flipper_ratio)

group_by(species)

penguins

summarize(med_mass_flipper_ratio = 
            median(mass_flipper_ratio))

mutate(mass_flipper_ratio = 
         body_mass_g / flipper_length_mm)

A Different Context

You have data on each Cal Poly student for the 2020-2021 academic year. You are tasked with reporting how the number of CR/NC courses students took differed based on department.

name department CRNC_f20 CRNC_w21 CRNC_s21
Clarke, Justin Business 0 1 1
Hernandez, Jorge Biology 1 0 1
Meng, Huy Business 1 0 0
el-Munir, Farhaan Chemistry 3 0 3
Miller, Marissa Liberal Studies 0 2 1
Crossley, David Biology 1 1 0
Lampe, Bianca Business 0 0 1
Padilla, Antonio Political Science 1 2 1
Tan, Alexandra Liberal Studies 0 2 1
Venkatesan, Patricia Political Science 1 1 1

Problem Statement:

Department totals for number of CR / NC courses


What data wrangling operations would you use?

What order would you use to accomplish this goal?

Step 1: Get totals for each student

students |> 
  group_by(name) |> 
  mutate(total_CRNC = sum(CRNC_f20, CRNC_w21, CRNC_s21)
         ) 

Step 2: Get department totals

students |> 
  group_by(name) |> 
  mutate(total_CRNC = sum(CRNC_f20, CRNC_w21, CRNC_s21)
         ) |> 
  group_by(department) |> 
  summarize(final_total = sum(total_CRNC)
            )

Step 3: Arrange the totals

students |> 
  group_by(name) |> 
  mutate(total_CRNC = sum(CRNC_f20, CRNC_w21, CRNC_s21)
         ) |> 
  group_by(department) |> 
  summarize(dept_total = sum(total_CRNC)
            ) |> 
  arrange(desc(dept_total))
# A tibble: 5 × 2
  department        dept_total
  <chr>                  <int>
1 Political Science          7
2 Chemistry                  6
3 Liberal Studies            6
4 Biology                    4
5 Business                   4

Getting Specific

Often you are interested in one specific summary statistic!

students |> 
  group_by(department) |> 
  count(sort = TRUE) 
# A tibble: 5 × 2
# Groups:   department [5]
  department            n
  <chr>             <int>
1 Business              3
2 Biology               2
3 Liberal Studies       2
4 Political Science     2
5 Chemistry             1
students |> 
  group_by(department) |> 
  count() |> 
  filter(department == "Political Science")
# A tibble: 1 × 2
# Groups:   department [1]
  department            n
  <chr>             <int>
1 Political Science     2

A Handy Tool

pull()

  • Extracts entries from dataframes
students |> 
  group_by(department) |> 
  count() |> 
  filter(department == "Political Science") |> 
  pull(n)
[1] 2

Your Turn: PA Exploration

  • Find your group
  • Introduce yourself
  • Decide on team roles (will change each week)
    • Reporter – Types the solutions
    • Editor – Asks professor team questions, double checks what is typed
    • Facilitator – Leads discussion, makes sure everyone understands the task
    • Captain – Encourages participation, enforces norms, brings conversation back if it deviates